Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Identifieur interne : 000223 ( Main/Exploration ); précédent : 000222; suivant : 000224

Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012

Auteurs : Andrew Thean [Allemagne] ; Jean-Marc Deltorn [Allemagne] ; Patrice Lopez [Allemagne] ; Laurent Romary [Allemagne]

Source :

RBID : Hal:hal-00728779

Abstract

The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author>
<name sortKey="Thean, Andrew" sort="Thean, Andrew" uniqKey="Thean A" first="Andrew" last="Thean">Andrew Thean</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Deltorn, Jean Marc" sort="Deltorn, Jean Marc" uniqKey="Deltorn J" first="Jean-Marc" last="Deltorn">Jean-Marc Deltorn</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Lopez, Patrice" sort="Lopez, Patrice" uniqKey="Lopez P" first="Patrice" last="Lopez">Patrice Lopez</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00728779</idno>
<idno type="halId">hal-00728779</idno>
<idno type="halUri">https://hal.inria.fr/hal-00728779</idno>
<idno type="url">https://hal.inria.fr/hal-00728779</idno>
<date when="2012-09-17">2012-09-17</date>
<idno type="wicri:Area/Hal/Corpus">000120</idno>
<idno type="wicri:Area/Hal/Curation">000120</idno>
<idno type="wicri:Area/Hal/Checkpoint">000061</idno>
<idno type="wicri:Area/Main/Merge">000227</idno>
<idno type="wicri:Area/Main/Curation">000223</idno>
<idno type="wicri:Area/Main/Exploration">000223</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012</title>
<author>
<name sortKey="Thean, Andrew" sort="Thean, Andrew" uniqKey="Thean A" first="Andrew" last="Thean">Andrew Thean</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Deltorn, Jean Marc" sort="Deltorn, Jean Marc" uniqKey="Deltorn J" first="Jean-Marc" last="Deltorn">Jean-Marc Deltorn</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Lopez, Patrice" sort="Lopez, Patrice" uniqKey="Lopez P" first="Patrice" last="Lopez">Patrice Lopez</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
<author>
<name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-95237" status="VALID">
<orgName>Institut für Deutsche Sprache und Linguistik</orgName>
<orgName type="acronym">IDSL</orgName>
<desc>
<address>
<addrLine>Dorotheenstraße 24, 10099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">http://www.linguistik.hu-berlin.de/</ref>
</desc>
<listRelation>
<relation active="#struct-139189" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-139189" type="direct">
<org type="institution" xml:id="struct-139189" status="VALID">
<orgName>Humboldt Universität zu Berlin [Berlin]</orgName>
<desc>
<address>
<addrLine>Unter den Linden 610099 Berlin</addrLine>
<country key="DE"></country>
</address>
<ref type="url">https://www.hu-berlin.de/en/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Allemagne</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass></textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The CLEF-IP 2012 track included the Flowchart Recognition task, an image-based task where the goal was to process binary images of flowcharts taken from patent draw- ings to produce summaries containing information about their structure. The textual summaries include information about the flowchart title, the box-node shapes, the con- necting edge types, text describing flowchart content and the structural relationships between nodes and edges. An algorithm designed for this task and characterised by the following method steps is presented: * Text-graphic segmentation based on connected-component clustering; * Line segment bridging with an adaptive, oriented filter; * Box shape classification using a stretch-invariant transform to extract features based on shape-specific symmetry; * Text object recognition using a noisy channel model to enhance the results of a commercial OCR package. Performance evaluation results for the CLEF-IP 2012 Flowchart Recognition task are not yet available so the performance of the algorithm has been measured by com- paring algorithm output with object-level ground-truth values. An average F-score was calculated by combining node classification and edge detection (ignoring edge di- rectivity). Using this measure, a third of all drawings were recognized without error (average F-score=1.00) and 75% show an F-score of 0.78 or better. The most impor- tant failure modes of the algorithm have been identified as text-graphic segmentation, line-segment bridging and edge directivity classification. The text object recognition module of the algorithm has been independently eval- uated. Two different state-of-the-art OCR software packages were compared and a post-correction method was applied to their output. Post-correction yields an im- provement of 9% in OCR accuracy and a 26% reduction in the word error rate.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Allemagne</li>
</country>
</list>
<tree>
<country name="Allemagne">
<noRegion>
<name sortKey="Thean, Andrew" sort="Thean, Andrew" uniqKey="Thean A" first="Andrew" last="Thean">Andrew Thean</name>
</noRegion>
<name sortKey="Deltorn, Jean Marc" sort="Deltorn, Jean Marc" uniqKey="Deltorn J" first="Jean-Marc" last="Deltorn">Jean-Marc Deltorn</name>
<name sortKey="Lopez, Patrice" sort="Lopez, Patrice" uniqKey="Lopez P" first="Patrice" last="Lopez">Patrice Lopez</name>
<name sortKey="Romary, Laurent" sort="Romary, Laurent" uniqKey="Romary L" first="Laurent" last="Romary">Laurent Romary</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000223 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000223 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:hal-00728779
   |texte=   Textual summarisation of flowcharts in patent drawings for CLEF-IP 2012
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024